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Abstract 

In this paper, we present methods in deep multimodal learning for fusing speech and visual 
modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an ap¬ 
proach where uni-modal deep networks are trained separately and their final hidden layers fused 
to obtain a joint feature space in which another deep network is built. While the audio network 
alone achieves a phone error rate (PER) of 41% under clean condition on the IBM large vocab¬ 
ulary audio-visual studio dataset, this fusion model achieves a PER of 35.83% demonstrating 
the tremendous value of the visual channel in phone classification even in audio with high signal 
to noise ratio. Second, we present a new deep network architecture that uses a bilinear softmax 
layer to account for class specific correlations between modalities. We show that combining the 
posteriors from the bilinear networks with those from the fused model mentioned above results 
in a further significant phone error rate reduction, yielding a final PER of 34.03%. 


1 Introduction 

Human speech perception is not only about hearing but also about seeing: our brain integrates 
the waveforms representing the speech information as well as the lips poses and motions, often 
called visemes, which carry important visual information about what is being said. This has been 
demonstrated by the so called McGurk effect [MM76], which shows that a voicing of ba and a 
mouthing of ga is perceived as being da. In the presence of noise and multiple speakers (cocktail 
party effect), humans rely on lip reading in order to enhance speech recognition [CHLN08]. The 
visual information is also important in a clean speech scenario as it helps in disambiguating voices 
with similar acoustics [Sum92]. 

In Audio-Visual Automatic Speech Recognition (AV-ASR), both audio recordings and videos of 
the person talking are available at training time. It is challenging to build models that integrates 
both visual and audio information, and that enhance the recognition performance of the overall 
system. While most previous works in AV-ASR focused on enhancing the performance in the noisy 
case [PNLM04, NKK+11], where the visual information can be crucial, we focus in this paper on 
showing that the visual information is indeed helpful even in the clean speech scenario. 

*This work was done while Youssef Mroueh was an intern in the Speech and Algorithms Group at IBM T.J Watson 
Research Center. 
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Multimodal learning consists of fusing and relating information coming from different sources, 
hence AV-ASR is an important multimodal problem. Finding correlations between different modal¬ 
ities, and modeling their interactions, has been addressed in various learning frameworks and has 
been applied to AV-ASR [Gea09, LS06, MHD96, PKPM07, PKPM09, PKPM06]. Deep Neural 
Networks (DNN) have shown impressive performance in both audio and visual classification tasks, 
which is why we restrict ourselves to the deep multimodal learning framework [YGS89, NKK+ll, 
SS, SGMN, AALB13]. 

In this paper, we propose methods in deep learning to fuse modalities, and validate them on 
the IBM AV-ASR Large Vocabulary Studio Dataset (Section 2). First we consider the training of 
two networks on the audio and the visual modality separately. Then, considering the last layer 
of each network as a better feature space, and concatenating them, we train a classifier on that 
joint representation, and obtain gains in Phone Error Rates (PER), with respect to an audio-only 
trained network. We then propose a new bilinear network that accounts for correlations between 
modalities and allows for joint training of the two networks, we show that a committee of such 
bilinear networks, fused at the level of posteriors, achieves a better PER in a clean speech scenario. 

The paper is organized as follows. In Section 2 we present the IBM AV-ASR large vocabulary 
studio dataset, our feature extraction pipeline for the audio and the visual channels. Next, in 
Section 3, we present results for the fusion of networks separately trained on each modality. In 
Section 4 we introduce the bilinear DNN that allows for a joint training and captures correlations 
between the two modalities, and derive its back-propagation algorithm in Section 5. Einally we 
present posterior combination of bimodal and bilinear bimodal DNNs in Section 6. 

2 Audio-Visual Data Set &: Feature Extraction 

In this Section we present the IBM AV-ASR Large Vocabulary Studio dataset, and our feature 
extraction pipeline. 

2.1 IBM AV-ASR Large Vocabulary Studio Dataset 

The IBM AV-ASR Large Vocabulary Studio Dataset consists of 40 hours of audio-visual recordings 
from 262 speakers. These were carried out in clean, studio conditions. The audio is sampled at 
16 KHz along with the video frame rate of 30 frames per second at 704 x 480 resolution. The 
vocabulary size in these recordings is 10,400 words. This data set was divided into a test set of 2 
hours of audio-l-video from 22 speakers, with the rest used for training. 

2.2 Feature Extraction 

Eor the audio channel we extract 24 MEGC coefficients at 100 frames per second. Nine consecutive 
frames of MEGG coefficients are stacked and projected to 40 dimensions using an LDA matrix. 
Input to the audio neural network is formed by concatenating ±4 LDA frames to the central frame 
of interest, resulting in an audio feature vector of dimension 360. 

Eor the visual channel we start by detecting the face in the image using the openGV implemen¬ 
tation of the Viola-Jones algorithm. We then do a mouth carving by an openCV mouth detection 
model. Both these utilize the ENGARA2 model as described in [MDHLll]. In order to get an in¬ 
variant representation to small distortions and scales we then extract level 1 and level 2 scattering 
coefficients [BM13] on the 64 x 64 mouth region of interest and then reduce their dimension to 60 
using LDA (Linear discriminant Analysis). In order to match the audio frame rate we replicate 
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video frames according to audio and video time stamps. We also add ±4 context frames to the 
central frame of interest, and obtain finally a visual feature vector of dimension 540. 

2.3 Context-dependent Phoneme Targets 

Each audio+video frame is labeled with one of 1328 targets that represent context dependent 
phonemes. 42 phones in phonetic context of ±2 are clustered using decision trees down to 1328 
classes. We measure classihcation error rate at the level of these 1328 classes, this is referred to as 
phone error rate (PER). 

3 Uni-modal DNNs &; Feature Fusion 

In the supervised multimodal scenario, we are given a training set 5 of labeled examples, and 
C classes: 

S = {{x\,xi,yi),i = l...A^}, yi^y = {l...C], 

where correspond to the first and the second modality feature vectors, respectively. We note 

ti = ey- the classification targets, where {ey}y^y is the canonical basis in Let p{y\x^,x‘^) be 
the posterior probability of being in class y given the two modalities x^ and x'^. In a classification 
task, we would like to find the model that maximizes the cross-entropy £: 

^ N C 

t^logpiylxj^xj). ( 1 ) 

i=l y=l 

The first multimodal modeling approach we study is to train two separate networks DNNa and 
DNNy on the audio and the visual features, respectively. The networks are optimized under the 
cross-entropy objective (1) using the stochastic gradient descent. We formed a joint Audio-Visual 
feature representation by concatenating the outputs of final hidden layers of these two networks, 
as shown in Figure 1. This feature space is then kept fixed while a deep or a shallow (softmax 
only) network is trained in this fused space up to the targets. To keep the feature space dimension 
manageable, we conhgure the individual audio and video networks to have a low dimensional final 
hidden layer. 




Figure 1: Bimodal DNN. 
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We consider for DNNa and DNNy the following architecture dim/1024/1024/1024/1024/1024/200/1328, 
where dim = 360 for DNNa and dim = 540 for DNNv The fused feature space dimension is 400. 

While DNNa achieves a PER of 41.25%, DNNy alone achieves a PER of 69.36%, showing that 
the visual information alone carries some information but that is not enough in itself to get a low 
error rate. A deep network built in the fused feature space results in a PER of 35.77% while a 
softmax layer only in this feature space yields PER of 35.83%. This substantial PER gain from 
joint audio-visual representation, even in clean audio conditions, demonstrates the value of visual 
information for the phoneme classification task. Interestingly, the deep and the shallow fusion are 
roughly on par in terms of PER. Results are summarized in the following table: 



PER 

Cross-Entropy 

DNNa (Audio Alone) 
DNNy (Visual Alone) 
Bimodal (DNN Fusion ) 
Bimodal (SoftMax Fusion) 

41.25% 

69.36% 

35 . 77 % 

35 . 83 % 

1.53948848 

3.24791566 

1.31047744 

1.31077926 


Table 1: Empirical Evaluation on the AV-ASR Studio dataset. 


4 Bilinear Deep Neural Network 


In the previous section the training was done separately on the two modalities, in this section we 
address the joint training problem, and introduce the bilinear bimodal DNN. 

For a DNN, we note by a the non linearity function (sigmoid in this paper), V£ the input of a 
unit, and hf the output of a unit in a layer i. For a layer i we note the dimension of an input vi by 
Ki. As shown in Figure 1, we consider two DNNs, one for each modality that we fuse at the level 
of the decision function. For simplicity of the exposure, we assume the same number of layers L 
(Li = L 2 = L). For the intermediate layers, we have the standard separate networks: 

hi = a{Wi'^vi + 6 ;^), 


W/ G , bi G ,e=l...L-l,j £{1,2}. 

The fusion happens at the last hidden layer, where the posteriors capture the correlation between 
the intermediate non-linear features of the two modalities produced by the DNN layers, through a 
bilinear term. Let v} = h}_^, v\ = h\_i, the posteriors have the following form: 


p{y\x^,x^) 



Z 


( 2 ) 


where Z = T.y'&y exp (^v^^'^Wy'vl + I/T 
M and y £ {1... C}. 



Wy £ R^l^^l,Vy = [V^,V^] £ R(^l+^l\b, 


4.1 Factored Bilinear Softmax 

As the number of classes increases, the bilinear model becomes cumbersome computationally, and 
we need large training sets to get better estimates of the parameters. In order to decrease the 
computational complexity of the model, we propose the use of a factorization of the bilinear term. 


4 







that is similar to the one in [MZHP], but is motivated in our case by Canonical Correlation Analysis 
(CCA) [HSST04]: 

Wy = 17Miag(wy)U2’^, y = 1... C, (3) 

where G G ,Wy G and diag(wy) is a diagonal matrix with Wy on its 

diagonal . For numerical stability we consider < X,j G {1,2}, where A is a regularization 

parameter. We note by F the dimension of the fused space, which is typically smaller than and 
. Considering the factorization in (3) and maximizing the cross-entropy in the bilinear model 
(2), we have ; 

logp{y\x^,x‘^) = Tr{U^'~^vlvl'^U‘^diag{wy)) + (V^,v^^) + {Vy,yl) + by - log(Z). 

For hxed weights Wy, learning {U^, U'^) corresponds to a class specihc weighted CCA-like learning 
where we are looking for projections that maximize alignment between the intermediate features of 
the two modalities, in a discriminative way. Deep CCA of [AALB13] shares similarities with this 
model. 

On the other hand, for fixed {U^, we can rewrite the log-posteriors in the following way: 

\og{p{y\x^,x^)) = (wy, o + {Vy,vl) + {V^, vl) + by - log(Z) 

where © is the element-wise vector product. 

Hence, for fixed (f7^,f7^), we are learning a linear hyperplane in the fused space of dimension 
F. The projection on and defines a CCA-like lower dimensional spaces, where the two 
modalities are maximally correlated. The fused space is then defined as the element-wise vector 
product between two co-occurring vectors in the CCA-like lower dimensional spaces. Hence, we 
can think of the last layer of the bilinear, bimodal DNN as being an ordinary softmax, having the 
following input 0 G Therefore the decision function is learned 

based on the individual contributions of the modalities v\ and as well as the joint representation 
produced by the fused space 0 v\. 

4.2 Factored Bilinear Softmax With Sharing 

When the classes we would like to predict are organized as the leaves of a tree structure of depth two, 
we can further reduce the computational complexity by sharing weights between leaves having the 
same parent node. This is the case in AV-ASR as the 1328 contextual phoneme states are organized 
as leaves of a tree, where the parent nodes correspond to 42 different phoneme categories. In that 
case we share the bilinear term across leaves having the same parents. By doing so in the case of 
AV-ASR, we are only taking into account the correlations between the audio and the visual channel 
at the phoneme level, rather than on a fine grained grid of contextual states. We can think of this 
sharing as a pooling operation at the phoneme level. More formally, assume that the label set y is 
partitioned into G non overlapping groups {Vg}g=i...G, we assume that: 

wy = wy = t/Miag(wg)U2’^, V y G Vg, g = 1... G. 

Hence we reduce the number of weights to learn for the joint representation from C x F to G x F. 

5 Back-propagation with the Factored Bilinear DNN with Sharing 

In this section we give the back-propagation algorithm and the update rules for the bilinear DNN 
with sharing (bi^-DDN-wS). Recall that our classes have a tree structure with leaves y, and parent 
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nodes g] a training example is therefore labeled by its leave label y (States) as well its parent node 
g (Phonemes), {x^,x‘^,y,g), y £ {1.. .C}, and g £ {1.. .G}. We use the notation g{y) to note the 
group to which y belongs, and we set Rootgf^y-^ = 1, Rootg = 0, 5 = 1... G, <7 / g{y)- 
For the bilinear softmax with sharing, we keep track of the errors at the level of the labels (States), 
as well as the groups level (Phonemes): 

5l = tk-p{k\vl,vl), fc = l...G, 

61; = Rootg - p{k\v\,vl), g = l...G, (5g G 

k&y. 


Let W = [rci,... ,wg\ £ the gradients of the parameters of the bilinear softmax are given 

by: 


d£ 

d£ 

din 

d£ 

im 


d£ 

vW£u^diag(W6G), ^ = VLVL’"^UMiag(W5G)- 

1 rT 2 i:T j: 


For the layer right before the Bilinear softmax, we have a double projection to the first modality 
network (audio stream) and to the second modality network (visual stream). 

We need to compute: 


W3 = C/Miag(wg)U2’^, g = 1... G. 

= Wnl mI~^^ = e . 

rrig^'^ = , ..., _ 


Let = \yl ... j £ {1,2}, hence the errors we propagate to each network have the following 
form: 

rif 

5i-i = ^^=Ml-^nG + VnL. (4) 

5Li = ^^=MI-^^6g + VHl. (5) 

Note that the errors now have an additional term, and M^'^'^Sg, respectively. We can 

think of those terms as messages passed between networks through the bilinear term. In that 
way, one network influences the weights of the other one. For the rest of the updates, it follows 
standard back-propagation in both networks; we give it here for completeness. Let uj = + 

bj, uj = Wf'^vj + hj, then finally we have: = vi{diag{a'{u\)5iy, = diag{a'{uj))5i, 

^j-i = J G {1, 2}, i = L — 1 .1, where 6\_-^, and 6\_ are given in equations (4) and (5). 

For each variable 0, we have an update rule 0 <— 0 -|- , where p is the learning rate. For and 

t/^, we need to keep control of the Frobenius norm by following the gradient step with a projection 
to the Frobenius ball : ■£- min(l, jjj^^), j G {1, 2}. 

Remark 1. For the bilinear softmax without sharing the update rules are similar (6 g is replaced 
by 6l). 
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6 Combining Posteriors from Bimodal and Bilinear Bimodal Net¬ 
works 

We experiment with various factored bi^-DNN-wS architectures, initialized at random on the IBM 
AV-ASR Large Vocabulary Studio Dataset. We use the following notation for the architecture of 
the bilinear network: [archa\archv\F], where archa and archy are the architectures of the audio 
and the visual network respectively, and F is the dimension of the fused space. We consider ar¬ 
chitectures by increasing complexity Arch = [360, 500, 500, 200,1328|540, 500, 500, 200,1328|F = 
200], Archi = [360, 600, 600,400,100,1328|540, 600,600,400,100,1328|F = 100], and Arch 2 = 
[360, 500, 500, 500, 500, 500, 200,1328|540, 500, 500, 500, 500, 500, 200,1328|F = 200]. 

In all our experiments we set A = 2. Recall that the bimodal DNN using the separate training 
paradigm introduced in Section 3 achieves 35.83% PER. As shown in Table 2, each architecture 
alone does not improve on the bimodal DNN, but averaging the posteriors of the three architec¬ 
tures we obtain a small gain. A gain of 1.8% absolute is obtained by averaging the posteriors of the 
bimodal and the bilinear bimodal networks, showing that the bilinear networks have uncorrelated 
errors with the bimodal network. 



PER 

Arch 

38.89% 

Archi 

39.01% 

Arch2 

38.36% 

Bimodal 

35.83% 

Arch Archi + Arch 2 

35 . 54 % 

Arch + Archi -\- Arch 2 + Bimodal 

34 . 03 % 


Table 2: Empirical evaluation on the AV-ASR Studio dataset. 

7 Conclusion 

In this paper we have studied deep multimodal learning for the task of phonetic classification from 
audio and visual modalities. We demonstrate that even in clean acoustic conditions using visual 
channel in addition to speech results in signifiantly improved classification performance. A bilinear 
bimodal DNN is introduced which leverages correlation between the audio and visual modalities, 
and leads to further error rate reduction. 
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